Spatial Statistics and Geospatial Data Visualization

By: Amr Mohamed

Table of contents:

  • 1: Key concepts in Geospatial data
    • 1.1: Raster and vector data
    • 1.2: Digital Elevation Models
  • 2: Interactive data visualization with
    • 2.1: Interactive data visualization with Folium
    • 2.2: Interactive mapping using Kepler
    • 2.3: DBSCAN and its application to trajectories clustering
  • 3: Spatial data visualization challenges
    • 3.1: Space-time trajectory
    • 3.2: Chicago criminality
    • 3.3: LIDAR data visualization
  • 4: Spatial statistics
    • 4.1: Introduction to point pattern statistics
    • 4.2: Point pattern statistics with Preston Crime
    • 4.3: Areal statistics
  • 5: Geostatistics
Importing Packages
In [ ]:
In [8]:
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
In [2]:
env: HV_DOC_HTML=true
In [ ]:

1: Key concepts in Geospatial data

1.1: Raster and vector data

Explain in your notebook the differences between raster and polygon data:

Polygons are a type of vector data that consist of a series of connected vertices that form a closed shape. They are often used to represent areas such as countries, states, and neighborhoods.Vector data consists of points, lines, and polygons. Each feature is represented by one or more vertices, which are connected by lines to form the feature. On the other hand, Raster data is any pixelated data where each pixel is associated with a specific geographical location. The value of a pixel can be continuous (e.g. elevation) or categorical (e.g. land use). Moreover, raster data is accompanied by spatial information that connects the data to a particular location.

Design a non-interactive plot of French school districts based on the dataset provided by the instructor. Each district should be represented with a different color.

In [44]:
Out[44]:
name vacances wikipedia geometry
0 Académie d'Aix-Marseille Zone B fr:Académie d'Aix-Marseille MULTIPOLYGON (((4.23013 43.46039, 4.23025 43.4...
1 Académie d'Amiens Zone B fr:Académie d'Amiens (éducation) MULTIPOLYGON (((1.38014 50.06499, 1.38214 50.0...
2 Académie de Besançon Zone A fr:Académie de Besançon MULTIPOLYGON (((5.25202 46.94451, 5.25208 46.9...
3 Académie de Bordeaux Zone A fr:Académie de Bordeaux (éducation) MULTIPOLYGON (((-1.79102 43.37292, -1.79048 43...
4 Académie de Caen Zone B fr:Académie de Caen (éducation) MULTIPOLYGON (((-1.94877 49.71649, -1.94836 49...
In [45]:

Find a shapefile representing the entire world, and design a set of maps based on different projections / coordinate reference systems (CRS) (ex.: Mercator, etc.).

In [23]:
Out[23]:
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
In [41]:
In [40]:
In [39]:

Write down a paragraph (in your notebook only) explaining the concept of CRS

Coordinate reference system (CRS) is a framework used to precisely measure locations on the surface of the Earth as coordinates. It provides a standardized way of describing locations. Many different CRS are used to describe geographic data. The CRS that is chosen depends on when the data was collected, the geographic extent of the data, the purpose of the data

Download a raster dataset from the SRTM website (elevation data from the entire world, +-30 meters in precision). Design a 2D map of the area you selected.

In [38]:
In [37]:
In [36]:

1.2 Digital Elevation Models

In [13]:
CGN Airport elevation (meters): 74
In [16]:
In [17]:
Creating /root/.cache/srtm
In [18]:
4 2884802
In [21]:
                Digital Elevation Model of Istra and Trieste
Out[21]:

1.2 Digital Elevation Model (Rayshader)

In [2]:
In [5]:
In [6]:

2: Interactive data visualization

2.1: Interactive data visualization with Folium

In [23]:
Out[23]:
name vacances wikipedia geometry
0 Académie d'Aix-Marseille Zone B fr:Académie d'Aix-Marseille MULTIPOLYGON (((4.23013 43.46039, 4.23025 43.4...
1 Académie d'Amiens Zone B fr:Académie d'Amiens (éducation) MULTIPOLYGON (((1.38014 50.06499, 1.38214 50.0...
2 Académie de Besançon Zone A fr:Académie de Besançon MULTIPOLYGON (((5.25202 46.94451, 5.25208 46.9...
3 Académie de Bordeaux Zone A fr:Académie de Bordeaux (éducation) MULTIPOLYGON (((-1.79102 43.37292, -1.79048 43...
4 Académie de Caen Zone B fr:Académie de Caen (éducation) MULTIPOLYGON (((-1.94877 49.71649, -1.94836 49...
In [24]:
In [25]:
C:\Users\Administrator\AppData\Local\Temp\ipykernel_12768\2399918202.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  x_map=french_schools.centroid.x.mean()
-1.3368242488745592 41.07040499085242
C:\Users\Administrator\AppData\Local\Temp\ipykernel_12768\2399918202.py:2: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  y_map=french_schools.centroid.y.mean()
In [26]:
Out[26]:
3.097.0
In [63]:
Out[63]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [64]:
C:\Users\Administrator\AppData\Local\Temp\ipykernel_12768\4257234760.py:1: DtypeWarning: Columns (11) have mixed types. Specify dtype option on import or set low_memory=False.
  phd = pd.read_csv("./../Datasets/PhD.v3.csv")
Out[64]:
Unnamed: 0 Auteur Identifiant auteur Titre Directeur de these Directeur de these (nom prenom) Identifiant directeur Etablissement de soutenance Identifiant etablissement Discipline ... Year Langue de la these Identifiant de la these Accessible en ligne Publication dans theses.fr Mise a jour dans theses.fr Discipline_prédi Genre etablissement_rec Langue_rec
0 0 Saeed Al marri NaN Le credit documentaire et l'onopposabilite des... Philippe Delebecque Delebecque Philippe 29561248 Paris 1 27361802 Driot prive ... NaN na s69480 non 26-01-12 26-01-12 Droit et Science Politique male Université Paris 1 - Panthéon Sorbonne NaN
1 1 Andrea Ramazzotti 174423705 Application de la PGD a la resolution de probl... Jean-Claude Grandidier,Marianne Beringhier Grandidier Jean-Claude,Beringhier Marianne 715,441,511 Chasseneuil-du-Poitou, Ecole nationale superie... 28024400 Mecanique des solides, des materiaux, des stru... ... NaN na s98826 non 22-11-13 22-11-13 Materiaux, Milieux et Chimie female École nationale supérieure de mécanique et d'a... NaN
2 2 OLIVIER BODENREIDER NaN Conception d'un outil informatique d'etude des... Francois Kohler Kohler Francois 57030758 Nancy 1 NaN Medecine ... 1993.0 fr 1993NAN19006 non 24-05-13 17-11-12 Medecine male Université de Lorraine Français
3 3 Emmanuel Porte NaN Socio-histoire des politiques publiques en mat... Gilles Pollet Pollet Gilles na Lyon 2 02640334X Science politique ... NaN na s88867 non 12-07-13 12-01-16 Droit et Science Politique male Université Lumière - Lyon 2 NaN
4 4 Arthur Devriendt NaN LES TECHNOLOGIES DE L'INFORMATION ET DE LA COM... Gabriel Dupuy Dupuy Gabriel na Paris 1 27361802 Geographie ... NaN na s89663 non 13-07-13 12-07-13 SHS male Université Paris 1 - Panthéon Sorbonne NaN

5 rows × 23 columns

In [65]:
Out[65]:
etablissement_rec count
0 AgroParisTech 1218
1 Agrocampus Ouest 531
2 Aix-Marseille Université 20873
3 Arts et Métiers Sciences et Technologies 857
4 Avignon Université 1080
In [66]:
Out[66]:
uai - identifiant Libellé Géolocalisation
0 0751875F École nationale supérieure d'architecture de P... 48.89325,2.38192
1 0694123G École normale supérieure de Lyon 45.7337,4.83376
2 0951214D École supérieure des sciences économiques et c... 49.03397,2.078427
3 0912408Y Université Paris-Saclay 48.70997,2.1531
4 0597139P Centrale Lille Institut 50.60648,3.13757
In [67]:
Out[67]:
uai - identifiant Libellé Géolocalisation etablissement_rec count
0 0694123G École normale supérieure de Lyon 45.7337,4.83376 École normale supérieure de Lyon 941
1 0951214D École supérieure des sciences économiques et c... 49.03397,2.078427 École supérieure des sciences économiques et c... 44
2 0912408Y Université Paris-Saclay 48.70997,2.1531 Université Paris-Saclay 24684
3 0597139P Centrale Lille Institut 50.60648,3.13757 Centrale Lille Institut 384
4 0331766R Université Bordeaux Montaigne 44.8015,-0.615058 Université Bordeaux Montaigne 2714
In [68]:
In [69]:
Out[69]:
Make this Notebook Trusted to load map: File -> Trust Notebook

2.2: Interactive mapping using Kepler

not: maps are sceeenshots from the html generated kepler map

kepler1.png

kepler2.png

2.3 DBSCAN and its application to trajectories clustering

In [46]:
In [47]:
In [48]:
In [50]:
In [51]:
In [52]:
In [53]:

DBSCAN Based on a set of points, groups together the points that are close to each other based on a distance measurement and a minimum number of points. It also marks as outliers the points that are in low-density regions. It takes into account 2 main parameters: epsilon which specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value, these points are considered neighbors., and minPoints: the minimum number of points to form a dense region.

3: Spatial data visualization challenges

3.1 Space-time trajectory

In [54]:
In [55]:
Out[55]:
event-id visible timestamp location-long location-lat algorithm-marked-outlier argos:altitude argos:best-level argos:calcul-freq argos:iq ... argos:sensor-2 argos:sensor-3 argos:sensor-4 argos:valid-location-algorithm manually-marked-outlier sensor-type individual-taxon-canonical-name tag-local-identifier individual-local-identifier study-name
0 15390742 True 2001-11-24 14:07:30.000 18.507 -33.939 NaN 20.0 -134.0 401.653355 56 ... 186 201 10 2 NaN argos-doppler-shift Ciconia ciconia 1983 1983C MPIAB White Stork South African Population Argos
1 15390745 True 2001-11-24 16:44:24.000 18.478 -33.963 NaN 20.0 -129.0 401.652946 51 ... 190 110 14 1 NaN argos-doppler-shift Ciconia ciconia 1983 1983C MPIAB White Stork South African Population Argos

2 rows × 30 columns

In [65]:
In [ ]:
Out[35]:
{1979, 1983, 20523, 20525, 20526, 20527}
In [66]:

3.2: Chicago criminality

In [74]:
IDCase.NumberDateBlockIUCRPrimary.TypeDescriptionLocation.DescriptionArrestDomestic...WardCommunity.AreaFBI.CodeX.CoordinateY.CoordinateYearUpdated.OnLatitudeLongitudeLocation
9446834 HX100098 01/01/2014 12:02:00 AM 072XX S MORGAN ST 1477 WEAPONS VIOLATION RECKLESS FIREARM DISCHARGE RESIDENTIAL YARD (FRONT/BACK) true false ... 17 68 15 1170897 1856795 2014 02/10/2018 03:50:01 PM 41.76252 -87.64920 (41.762519013, -87.64919809)
9446765 HX100013 01/01/2014 12:03:00 AM 064XX S ROCKWELL ST 143A WEAPONS VIOLATION UNLAWFUL POSS OF HANDGUN RESIDENTIAL YARD (FRONT/BACK) true false ... 15 66 15 1160145 1861909 2014 02/10/2018 03:50:01 PM 41.77678 -87.68847 (41.776780469, -87.688465418)
9446921 HX100067 01/01/2014 12:04:00 AM 048XX S PRAIRIE AVE 1460 WEAPONS VIOLATION POSS FIREARM/AMMO:NO FOID CARDAPARTMENT true false ... 3 38 15 1178859 1872997 2014 02/04/2016 06:33:39 AM 41.80680 -87.61952 (41.806801421, -87.61952336)
9446783 HX100093 01/01/2014 12:05:00 AM 011XX W 50TH ST 143A WEAPONS VIOLATION UNLAWFUL POSS OF HANDGUN PARKING LOT/GARAGE(NON.RESID.)true false ... 16 61 15 1169689 1871646 2014 02/10/2018 03:50:01 PM 41.80330 -87.65320 (41.8032982, -87.653195042)
9446811 HX100017 01/01/2014 12:05:00 AM 031XX W WALNUT ST 143A WEAPONS VIOLATION UNLAWFUL POSS OF HANDGUN RESIDENCE PORCH/HALLWAY true false ... 27 27 15 1155344 1901461 2014 02/10/2018 03:50:01 PM 41.88541 -87.70501 (41.885413516, -87.705005317)
9446958 HX100026 01/01/2014 12:05:00 AM 005XX E 67TH ST 0550 ASSAULT AGGRAVATED PO: HANDGUN STREET true false ... 20 42 04A 1181179 1860714 2014 02/10/2018 03:50:01 PM 41.77304 -87.61139 (41.773042565, -87.61139273)
In [75]:
In [76]:
In [77]:
In [78]:
xyorderholepieceidgroup
109248219431661 FALSE 1 0 0.1
109243019458122 FALSE 1 0 0.1
109289419458223 FALSE 1 0 0.1
109292519448204 FALSE 1 0 0.1
109378419446685 FALSE 1 0 0.1
109407619446706 FALSE 1 0 0.1
In [83]:
In [88]:
Scale for 'alpha' is already present. Adding another scale for 'alpha', which
will replace the existing scale.
In [89]:

3D Facetgrid of Primary types heatmaps

In [37]:
In [24]:

lolipop plot

In [97]:
`summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
YearPrimary.Typen
2014 ASSAULT 1988
2014 BATTERY 1605
2014 CRIM SEXUAL ASSAULT 69
2014 CRIMINAL SEXUAL ASSAULT 4
2014 ROBBERY 3799
2014 WEAPONS VIOLATION 2663
In [104]:

3.3 LIDAR data visualization

In [16]:

Download a sample LiDAR dataset from Google Drive. The zip file is 52.1 MB and the uncompressed LAS file is 109 MB.

In [3]:
In [4]:
Downloading...
From: https://drive.google.com/uc?id=1H_X1190vL63BoFYa_cVBDxtIa8rG-Usb
To: H:\CY Tech Year 4\Semester 1\Geospatial\Noteooks\madison.zip
100%|█████████████████████████████████████████████████████████████████████████████| 54.7M/54.7M [00:03<00:00, 14.0MB/s]
Extracting files...

Read the LiDAR data

In [5]:

The LAS header.

In [6]:
Out[6]:
<LasHeader(1.3, <PointFormat(1, 0 bytes of extra dims)>)>

The number of points.

In [7]:
Out[7]:
4068294

The list of features.

In [8]:
Out[8]:
['X',
 'Y',
 'Z',
 'intensity',
 'return_number',
 'number_of_returns',
 'scan_direction_flag',
 'edge_of_flight_line',
 'classification',
 'synthetic',
 'key_point',
 'withheld',
 'scan_angle_rank',
 'user_data',
 'point_source_id',
 'gps_time']

Inspect data.

In [9]:
Out[9]:
array([5324343, 5324296, 5323993, ..., 5784049, 5784359, 5784667])
In [10]:
Out[10]:
array([8035264, 8035347, 8035296, ..., 7550110, 7550066, 7550026])
In [11]:
Out[11]:
array([36696, 34835, 34826, ..., 36839, 36858, 36842])
In [12]:
Out[12]:
array([ 9, 41, 24, ..., 87, 80, 95], dtype=uint16)

Visualize LiDAR data using the open3d backend.

In [15]:
Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.

Gif of the pop up window output by the code chunk above

4: spatial statistics

4.1: Introduction to point pattern statistics

In [34]:
	Chi-squared test of CSR using quadrat counts

data:  p_cluster
X2 = 53.387, df = 24, p-value = 0.0005142
alternative hypothesis: clustered

Quadrats: 25 tiles (irregular windows)
In [35]:
	Chi-squared test of CSR using quadrat counts

data:  p_regular
X2 = 8.57, df = 24, p-value = 0.001619
alternative hypothesis: regular

Quadrats: 25 tiles (irregular windows)
In [36]:
In [37]:
In [38]:

G(r)=1exp(lambdapir2)

G(r) is the probability of a point having a nearest neighbor within a distance r (the corresponding cdf of the pdf of the NN distances).

To calculate G(r) for a pattern, look at each event in the pattern and find the distance to the nearest event, we do this for every event, and plot a histogram to give an estimate of the probability density function of the nearest neighbor distribution. The corresponding cumulative distribution function is the G(r) we are searching for.

Border correction in spatial statistics involves correcting for the effects of the boundaries of the region on the calculated value of G(r). This is important because the presence of boundaries can influence the distribution of points within the region, leading to inaccurate results. There are several methods that can be used to correct for these effects, including the use of periodic boundary conditions, which involve replicating the region in all dimensions to create an infinite array of identical copies. This can help to mitigate the influence of the boundaries on the calculated value of G(r).

In [16]:
In [17]:
In [18]:

14 Explain the role of the K-function and the envelope, and generally speaking how Ripley’s K work

K(r)=pir2the area of a circle of with radius r

Ripley's K-function is used to compare a given point distribution with a random distribution where the point distribution under investigation is tested against the null hypothesis that the points are distributed randomly and independently. K(r) is defined as the expected number of points within a distance of a point of the process, scaled by the intensity.

An envelope is a tool to assess uncertanities on K-function estimates by randomly sampling points from a uniform Poisson process in the area and computing the K-function of the simulated data. Repeat this process 99 times, and take the minimum and maximum value of K over each of the distance values. If the K-function from the data goes above the top of the envelope then we have evidence for clustering. If the K-function goes below the envelope then there is evidence for an inhibitory process causing points to be spaced out.

In [20]:

Generally speaking, if the data was generated by a Poisson process, then the line should be close to zero for all values of r, so we subtract pi*r^2 verify this fact. In the case above, we see that the red line stayed at zero for all values of r.

In [23]:
Generating 99 simulations of CSR  ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.

Done.
In [24]:
In [25]:
Generating 99 simulations of CSR  ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.

Done.

Real world data

In [40]:
In [43]:
In [48]:
In [96]:
In [99]:
	Chi-squared test of CSR using quadrat counts

data:  bei
X2 = 2009.9, df = 24, p-value < 2.2e-16
alternative hypothesis: clustered

Quadrats: 5 by 5 grid of tiles

From the quadrate test above, we can conclude that the tree pattern is clustered as the p-value < 2.2e-16, so we can reject the null hypothesis that the data is coming from a uniform Poisson point process.

4.2: Point pattern statistics with Preston Crime

In [41]:
                x                       y                       mark     
 349847,286386053:   1   425746,70161589 :   1   Non-violent crime:1812  
 349861,601260038:   1   425778,026335666:   1   Violent crime    : 224  
 349879,376771875:   1   425778,953254153:   1                           
 349888,695199284:   1   425788,761293382:   1                           
 349902,595749144:   1   425792,560420541:   1                           
 349903,203841941:   1   425793,869626957:   1                           
 (Other)         :2030   (Other)         :2030                           
In [42]:
 osm_preston_gray.1 osm_preston_gray.2 osm_preston_gray.3 osm_preston_gray.4
 Min.   :  0.0      Min.   :  0.0      Min.   :  0.0      Min.   :255       
 1st Qu.:206.0      1st Qu.:206.0      1st Qu.:206.0      1st Qu.:255       
 Median :224.0      Median :224.0      Median :224.0      Median :255       
 Mean   :219.8      Mean   :219.8      Mean   :219.8      Mean   :255       
 3rd Qu.:239.0      3rd Qu.:239.0      3rd Qu.:239.0      3rd Qu.:255       
 Max.   :254.0      Max.   :254.0      Max.   :254.0      Max.   :255       
In [45]:
Marked planar point pattern:  2036 points
Average intensity 4.053214e-05 points per square unit

Coordinates are given to 2 decimal places
i.e. rounded to the nearest multiple of 0.01 units

Multitype:
                  frequency proportion    intensity
Non-violent crime      1812  0.8899804 3.607281e-05
Violent crime           224  0.1100196 4.459332e-06

Window: polygonal boundary
single connected closed polygon with 99 vertices
enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
                     (7998 x 7999 units)
Window area = 50231700 square units
Fraction of frame area: 0.785
In [46]:
osm_preston_gray.1osm_preston_gray.2osm_preston_gray.3osm_preston_gray.4
Min. 0 0 0255
1st Qu.206206206255
Median224224224255
3rd Qu.239239239255
Max.254254254255
NA's 0 0 0 0
In [47]:
Non-violent crime     Violent crime 
             1812               224 
In [48]:
In [49]:
In [50]:

4.3: Areal statistics

In [51]:
In [52]:
Object of class SpatialPolygonsDataFrame
Coordinates:
       min      max
x 503574.2 561956.7
y 155850.8 200933.6
Is projected: TRUE 
proj4string :
[+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
+y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
+towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
Data attributes:
     NAME             TOTAL_POP        Electorate       Votes_Cast    
 Length:32          Min.   :157711   Min.   : 83042   Min.   : 54801  
 Class :character   1st Qu.:237717   1st Qu.:143458   1st Qu.:104079  
 Mode  :character   Median :272017   Median :168394   Median :116280  
                    Mean   :270780   Mean   :169337   Mean   :118025  
                    3rd Qu.:316911   3rd Qu.:196285   3rd Qu.:134142  
                    Max.   :379691   Max.   :245349   Max.   :182570  
     Remain           Leave       Rejected_Ballots   Pct_Remain   
 Min.   : 27750   Min.   :17138   Min.   : 60.0    Min.   :30.34  
 1st Qu.: 55973   1st Qu.:32138   1st Qu.:105.0    1st Qu.:53.69  
 Median : 70254   Median :45263   Median :138.0    Median :61.01  
 Mean   : 70631   Mean   :47255   Mean   :139.0    Mean   :60.46  
 3rd Qu.: 84287   3rd Qu.:59018   3rd Qu.:164.2    3rd Qu.:69.90  
 Max.   :118463   Max.   :96885   Max.   :267.0    Max.   :78.62  
   Pct_Leave      Pct_Rejected      Assembly        
 Min.   :21.38   Min.   :0.0600   Length:32         
 1st Qu.:30.10   1st Qu.:0.0875   Class :character  
 Median :38.99   Median :0.1100   Mode  :character  
 Mean   :39.54   Mean   :0.1187                     
 3rd Qu.:46.31   3rd Qu.:0.1500                     
 Max.   :69.66   Max.   :0.2200                     
In [53]:
  1. 'Lewisham'
  2. 'Merton'
  3. 'Newham'
  4. 'Redbridge'
  5. 'Richmond upon Thames'
  6. 'Southwark'
  7. 'Tower Hamlets'
  8. 'Waltham Forest'
  9. 'Wandsworth'
  10. 'Westminster'
  11. 'Barnet'
  12. 'Brent'
  13. 'Bromley'
  14. 'Camden'
  15. 'Croydon'
  16. 'Ealing'
  17. 'Enfield'
  18. 'Greenwich'
  19. 'Hackney'
  20. 'Hammersmith and Fulham'
  21. 'Haringey'
  22. 'Harrow'
  23. 'Hounslow'
  24. 'Islington'
  25. 'Kensington and Chelsea'
  26. 'Kingston upon Thames'
  27. 'Lambeth'
In [54]:
In [56]:
Loading required package: spData
To access larger datasets in this package, install the spDataLarge
package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`
Loading required package: sf
Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
Mean size error for iteration 1: 1.5881743190908
Mean size error for iteration 2: 1.30262460145333
Mean size error for iteration 3: 1.16605703069549
Mean size error for iteration 4: 1.09652671280571
Mean size error for iteration 5: 1.05979301980903
Mean size error for iteration 6: 1.03870642059676
Mean size error for iteration 7: 1.02576724388584
Mean size error for iteration 8: 1.01780225318077
Mean size error for iteration 9: 1.0124614648423
Mean size error for iteration 10: 1.00876130634916
Mean size error for iteration 11: 1.00621346049453
Mean size error for iteration 12: 1.00446957764083
Mean size error for iteration 13: 1.00322569444771
Mean size error for iteration 14: 1.00233570000671
Mean size error for iteration 15: 1.00169692422828
In [57]:
Neighbour list object:
Number of regions: 32 
Number of nonzero links: 126 
Percentage nonzero weights: 12.30469 
Average number of links: 3.9375 
In [58]:
In [59]:
In [60]:
	Monte-Carlo simulation of Moran I

data:  london_ref$Pct_Remain 
weights: nb2listw(borough_nb)  
number of simulations + 1: 1000 

statistic = 0.42841, observed rank = 999, p-value = 0.001
alternative hypothesis: greater

8) Explain how the Moran’s I distribution is made:
The Moran's I distribution is created by repeating the steps of a permutation test for the Moran's I statistic a large number of times. Specifically, the distribution is created by permuting the values of the dependent variable among the observations, calculating the Moran's I statistic for each permutation, and storing the resulting values. This creates a distribution of possible values of the Moran's I statistic under the null hypothesis of no spatial autocorrelation.

Flu data

In [62]:
In [63]:
In [64]:
0.000142412774772119
In [65]:
In [66]:
In [67]:

SMR corresponds to Standardized Morbidity Ratio, and it is the number of cases per person, but scaled by the overall incidence so that the expected number is 1.

GLMs

In [68]:
In [71]:

The yellow borough had many more cases of flu than expected; the black borough had many fewer cases.

In [39]:
	Monte-Carlo simulation of Moran I

data:  london$Flu_Resid 
weights: nb2listw(borough_nb)  
number of simulations + 1: 1000 

statistic = 0.15059, observed rank = 926, p-value = 0.074
alternative hypothesis: greater
In [49]:
Call:
glm(formula = Flu_OBS ~ HealthDeprivation, family = poisson, 
    data = london, offset = log(TOTAL_POP))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-9.5361  -4.5285  -0.0499   2.9043   8.2194  

Coefficients:
                  Estimate Std. Error  z value Pr(>|z|)    
(Intercept)       -8.78190    0.02869 -306.043   <2e-16 ***
HealthDeprivation  0.65689    0.06797    9.665   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 703.75  on 31  degrees of freedom
Residual deviance: 605.03  on 30  degrees of freedom
AIC: 762.37

Number of Fisher Scoring iterations: 5
In [50]:
Call:
bayesx(formula = Flu_OBS ~ HealthDeprivation, data = data.frame(london), 
    offset = log(london$TOTAL_POP), control = bayesx.control(seed = 17610407), 
    family = "poisson")
 
Fixed effects estimation results:

Parametric coefficients:
                     Mean      Sd    2.5%     50%   97.5%
(Intercept)       -8.7825  0.0293 -8.8386 -8.7839 -8.7238
HealthDeprivation  0.6598  0.0703  0.5244  0.6581  0.8062
 
N = 32  burnin = 2000  method = MCMC  family = poisson  
iterations = 12000  step = 10  
In [51]:

7) Explain how the density of the healthDeprivation parameter is built and what it means:
The density of the healthDeprivation parameter is called the posterior distribution which is the probability distribution of the healthDeprivation parameter given the data.

In [52]:
In [53]:
Note: created new output directory 'C:/Users/ADMINI~1/AppData/Local/Temp/Rtmp2vdXMR/bayesx1'!
In [54]:
Call:
bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", 
    map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP), 
    control = bayesx.control(seed = 17610407), family = "poisson")
 
Fixed effects estimation results:

Parametric coefficients:
                     Mean      Sd    2.5%     50%   97.5%
(Intercept)       -9.2118  0.1011 -9.4162 -9.2146 -9.0000
HealthDeprivation  0.8790  0.4366  0.1308  0.8385  1.8318

Smooth terms variances:
            Mean     Sd   2.5%    50%  97.5%    Min    Max
sx(i):mrf 4.6137 1.6530 2.2749 4.3161 8.5822 1.5469 13.634
 
N = 32  burnin = 2000  method = MCMC  family = poisson  
iterations = 12000  step = 10  

10) by contrast with the spatial model, now we can see the Smooth terms variances.

In [55]:
In [56]:
In [57]:
	Monte-Carlo simulation of Moran I

data:  london$spatial_resid 
weights: nb2listw(borough_nb)  
number of simulations + 1: 1000 

statistic = -0.21692, observed rank = 46, p-value = 0.954
alternative hypothesis: greater

From the test above we fail to reject the null hypothesis that there is no spatial autocorrelation in the residuals, and that the model is correct, since the residuals have no spatial correlation.

5: Geostatistics

note: 3d visualizations in this section are screenshots of the pop up window output by the code

Importing GemPy

In [2]:
Not subsurface compatibility available
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

Importing and creating a set of input data

The data used for the construction of a model in GemPy is stored in Python objects. The main data classes are:

::

-  Surface_points
-  Orientations
-  Grid
-  Surfaces
-  Series
-  Additional data
-  Faults

We will see each of this class in further detail in the future.

Most of data can also be generated from raw data that comes in the form of CSV-files (CSV = comma-separated values). Such files might be attained by exporting model data from a different program such as GeoModeller or by simply creating it in spreadsheet software such as Microsoft Excel or LibreOffice Calc.

In this tutorial, all input data is created by importing such CSV-files. These exemplary files can be found in the input_data folder in the root folder of GemPy. The data comprises x-, y- and z-positional values for all surface points and orientation measurements. For the latter, poles, azimuth and polarity are additionally included. Surface points are furthermore assigned a formation. This might be a lithological unit such as "Sandstone" or a structural feature such as "Main Fault". It is decisive to remember that, in GemPy, interface position points mark the bottom of a layer. If such points are needed to resemble a top of a formation (e.g. when modeling an intrusion), this can be achieved by defining a respectively inverted orientation measurement.

As we generate our Data from CSV-files, we also have to define our model's real extent in x, y and z, as well as declare a desired resolution for each axis. This resolution will in turn determine the number of voxels used during modeling. Here, we rely on a medium resolution of 50x50x50, amounting to 125,000 voxels. The model extent should be chosen in a way that it contains all relevant data in a representative space. As our model voxels are not cubes, but prisms, the resolution can take a different shape than the extent. We don't recommend going much higher than 100 cells in every direction (1,000,000 voxels), as higher resolutions will become increasingly expensive to compute.

In [3]:
In [4]:
Active grids: ['regular']
Out[4]:
Tutorial_ch1_1_Basics  2023-01-21 14:01
In [5]:
Out[5]:
  surface series order_surfaces color id
0 Shale Default series 1 #015482 1
1 Sandstone_1 Default series 2 #9f0052 2
2 Siltstone Default series 3 #ffbe00 3
3 Sandstone_2 Default series 4 #728f02 4
4 Main_Fault Default series 5 #443988 5
5 basement Basement 1 #ff3f20 6

The input data can then be listed using the command get_data. Note that the order of formations and respective allocation to series is still completely arbitrary. We will fix this in the following.

In [6]:
Out[6]:
X Y Z smooth surface
0 1000 50 450.00 2.00e-06 Shale
1 1000 150 433.33 2.00e-06 Shale
2 1000 300 433.33 2.00e-06 Shale
3 1000 500 466.67 2.00e-06 Shale
4 1000 1000 533.33 2.00e-06 Shale
In [7]:
Out[7]:
X Y Z G_x G_y G_z smooth surface
0 1000 1000 300 0.32 1.00e-12 0.95 0.01 Shale
1 400 1000 420 0.32 1.00e-12 0.95 0.01 Sandstone_2
2 500 1000 300 -0.95 1.00e-12 0.32 0.01 Main_Fault

Declaring the sequential order of geological formations

In [8]:
Out[8]:
  surface series order_surfaces color id
0 Shale Default series 1 #015482 1
1 Sandstone_1 Default series 2 #9f0052 2
2 Siltstone Default series 3 #ffbe00 3
3 Sandstone_2 Default series 4 #728f02 4
4 Main_Fault Default series 5 #443988 5
5 basement Basement 1 #ff3f20 6
In [9]:
Out[9]:
  surface series order_surfaces color id
4 Main_Fault Fault_Series 1 #443988 1
0 Shale Strat_Series 1 #015482 2
1 Sandstone_1 Strat_Series 2 #9f0052 3
2 Siltstone Strat_Series 3 #ffbe00 4
3 Sandstone_2 Strat_Series 4 #728f02 5
5 basement Strat_Series 5 #ff3f20 6
In [10]:
Out[10]:
  surface series order_surfaces color id
4 Main_Fault Fault_Series 1 #443988 1
0 Shale Strat_Series 1 #015482 2
1 Sandstone_1 Strat_Series 2 #9f0052 3
2 Siltstone Strat_Series 3 #ffbe00 4
3 Sandstone_2 Strat_Series 4 #728f02 5
5 basement Strat_Series 5 #ff3f20 6
In [11]:
Out[11]:
order_series BottomRelation isActive isFault isFinite
Fault_Series 1 Erosion True False False
Strat_Series 2 Erosion True False False
In [12]:
Fault colors changed. If you do not like this behavior, set change_color to False.
Out[12]:
order_series BottomRelation isActive isFault isFinite
Fault_Series 1 Fault True True False
Strat_Series 2 Erosion True False False
In [13]:
Out[13]:
Fault_Series Strat_Series
Fault_Series False True
Strat_Series False False
In [14]:
Out[14]:
order_series BottomRelation isActive isFault isFinite
Fault_Series 1 Fault True True False
Strat_Series 2 Erosion True False False
In [15]:
Out[15]:
Fault_Series Strat_Series
Fault_Series False True
Strat_Series False False

Returning information from our input data

Our model input data, here named "geo_model", contains all the information that is essential for the construction of our model. You can access different types of information by using gp.get_data or simply by accessiong the atrribues.

We can, for example, return the coordinates of our modeling grid via:

In [16]:
Out[16]:
Grid Object. Values: 
array([[  20. ,   20. ,    7.5],
       [  20. ,   20. ,   22.5],
       [  20. ,   20. ,   37.5],
       ...,
       [1980. , 1980. ,  712.5],
       [1980. , 1980. ,  727.5],
       [1980. , 1980. ,  742.5]])

As mentioned before, GemPy's core algorithm is based on interpolation of two types of data: - surface_points and - orientation measurements

(if you want to know more on how this this interpolation algorithm works, checkout our paper: https://www.geosci-model-dev.net/12/1/2019/gmd-12-1-2019.pdf).

We introduced the function get\_data above. You can also specify which kind of data you want to call, by declaring the string attribute "dtype" to be either 'surface_points' (interfaces) or 'orientations'\ .

Interfaces Dataframe:

In [17]:
Out[17]:
X Y Z smooth surface
52 700 1000 300.0 2.00e-06 Main_Fault
53 600 1000 200.0 2.00e-06 Main_Fault
54 500 1000 100.0 2.00e-06 Main_Fault
55 1000 1000 600.0 2.00e-06 Main_Fault
56 1100 1000 700.0 2.00e-06 Main_Fault

Orientations Dataframe:

In [18]:
Out[18]:
X Y Z G_x G_y G_z smooth surface
2 500 1000 300 -0.95 1.00e-12 0.32 0.01 Main_Fault
0 1000 1000 300 0.32 1.00e-12 0.95 0.01 Shale
1 400 1000 420 0.32 1.00e-12 0.95 0.01 Sandstone_2

Notice that now all surfaces have been assigned to a series and are displayed in the correct order (from young to old).

Visualizing input data

We can also visualize our input data. This might for example be useful to check if all points and measurements are defined the way we want them to. Using the function plot_data\ , we attain a 2D projection of our data points onto a plane of chosen direction (we can choose this attribute to be either x, y or z).

In [19]:
H:\Users\Administrator\lib\site-packages\gempy\plot\plot_api.py:261: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  p.fig.show()

Using plot_data_3D\ , we can also visualize this data in 3D. Note that direct 3D visualization in GemPy requires the Visualization Toolkit <https://www.vtk.org/>__ (VTK) to be installed.

All 3D plots in GemPy are interactive. This means that we can drag and drop any data point and measurement. The perpendicular axis views in VTK are particularly useful to move points solely on a desired 2D plane. Any changes will then be stored permanently in the "InputData" dataframe. If we want to reset our data points, we will then need to reload our original input data.

Executing the cell below will open a new window with a 3D interactive plot of our data.

In [20]:
H:\Users\Administrator\lib\site-packages\pyvista\utilities\helpers.py:507: UserWarning: Points is not a float type. This can cause issues when transforming or applying filters. Casting to ``np.float32``. Disable this by passing ``force_float=False``.
  warnings.warn(

Model generation

Once we have made sure that we have defined all our primary information as desired in our object DataManagement.InputData (named geo_data in these tutorials), we can continue with the next step towards creating our geological model: preparing the input data for interpolation.

This is done by generating an InterpolatorData object (named interp_data in these tutorials) from our InputData object via the following function:

In [21]:
Setting kriging parameters to their default values.
Compiling theano function...
H:\Users\Administrator\lib\site-packages\theano\scan_module\scan_perform_ext.py:75: UserWarning: The file scan_perform.c is not available. This donot happen normally. You are probably in a strangesetup. This mean Theano can not use the cython code for scan. If youwant to remove this warning, use the Theano flag'cxx=' (set to an empty string) to disable all ccode generation.
  warnings.warn(
H:\Users\Administrator\lib\site-packages\theano\scan_module\scan_perform_ext.py:75: UserWarning: The file scan_perform.c is not available. This donot happen normally. You are probably in a strangesetup. This mean Theano can not use the cython code for scan. If youwant to remove this warning, use the Theano flag'cxx=' (set to an empty string) to disable all ccode generation.
  warnings.warn(
H:\Users\Administrator\lib\site-packages\theano\scan_module\scan_perform_ext.py:75: UserWarning: The file scan_perform.c is not available. This donot happen normally. You are probably in a strangesetup. This mean Theano can not use the cython code for scan. If youwant to remove this warning, use the Theano flag'cxx=' (set to an empty string) to disable all ccode generation.
  warnings.warn(
H:\Users\Administrator\lib\site-packages\theano\scan_module\scan_perform_ext.py:75: UserWarning: The file scan_perform.c is not available. This donot happen normally. You are probably in a strangesetup. This mean Theano can not use the cython code for scan. If youwant to remove this warning, use the Theano flag'cxx=' (set to an empty string) to disable all ccode generation.
  warnings.warn(
H:\Users\Administrator\lib\site-packages\theano\scan_module\scan_perform_ext.py:75: UserWarning: The file scan_perform.c is not available. This donot happen normally. You are probably in a strangesetup. This mean Theano can not use the cython code for scan. If youwant to remove this warning, use the Theano flag'cxx=' (set to an empty string) to disable all ccode generation.
  warnings.warn(
H:\Users\Administrator\lib\site-packages\theano\scan_module\scan_perform_ext.py:75: UserWarning: The file scan_perform.c is not available. This donot happen normally. You are probably in a strangesetup. This mean Theano can not use the cython code for scan. If youwant to remove this warning, use the Theano flag'cxx=' (set to an empty string) to disable all ccode generation.
  warnings.warn(
H:\Users\Administrator\lib\site-packages\theano\scan_module\scan_perform_ext.py:75: UserWarning: The file scan_perform.c is not available. This donot happen normally. You are probably in a strangesetup. This mean Theano can not use the cython code for scan. If youwant to remove this warning, use the Theano flag'cxx=' (set to an empty string) to disable all ccode generation.
  warnings.warn(
Level of Optimization:  fast_compile
Device:  cpu
Precision:  float64
Number of faults:  1
Compilation Done!
Kriging values: 
                     values
range              2926.17
$C_o$            203869.05
drift equations     [3, 3]
Out[21]:
<gempy.core.interpolator.InterpolatorModel at 0x15aab923eb0>

This function rescales the extent and coordinates of the original data (and store it in the attribute geo_data_res which behaves as a usual InputData object) and adds mathematical parameters that are needed for conducting the interpolation. The computation of this step may take a while, as it also compiles a theano function which is required for the model computation. However, should this not be needed, we can skip it by declaring compile_theano = False in the function.

Furthermore, this preparation process includes an assignment of numbers to each formation. Note that GemPy's always creates a default basement formation as the last formation number. Afterwards, numbers are allocated from youngest to oldest as defined by the sequence of series and formations. On the property formations on our interpolation data, we can find out which number has been assigned to which formation:

The parameters used for the interpolation can be returned using the function get_kriging_parameters. These are generated automatically from the original data, but can be changed if needed. However, users should be careful doing so, if they do not fully understand their significance.

In [54]:
Out[54]:
values
range 2926.17
Co 203869.05
drift equations [3, 3]

At this point, we have all we need to compute our full model via compute_model. By default, this will return two separate solutions in the form of arrays. The first gives information on the lithological formations, the second on the fault network in the model. These arrays consist of two subarrays as entries each:

  1. Lithology block model solution:

    • Entry [0]: This array shows what kind of lithological formation is found in each voxel, as indicated by a respective formation_number.
    • Entry [1]: Potential field array that represents the orientation of lithological units and layers in the block model.
  2. Fault network block model solution:

    • Entry [0]: Array in which all fault-separated areas of the model are represented by a distinct number contained in each voxel.
    • Entry [1]: Potential field array related to the fault network in the block model.

Below, we illustrate these different model solutions and how they can be used.

In [55]:
In [56]:
Out[56]:
Lithology ids 
  [   6.    6.    6. ... -100. -100. -100.] 
In [57]:
Out[57]:
Lithology ids 
  [   6.    6.    6. ... -100. -100. -100.] 

Direct model visualization in GemPy

Model solutions can be easily visualized in 2D sections in GemPy directly. Let's take a look at our lithology block:

In [58]:
H:\Users\Administrator\lib\site-packages\gempy\plot\plot_api.py:261: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  p.fig.show()

With cell_number=25 and remembering that we defined our resolution to be 50 cells in each direction, we have chosen a section going through the middle of our block. We have moved 25 cells in direction='y', the plot thus depicts a plane parallel to the x- and y-axes. Setting plot_data=True, we could plot original data together with the results. Changing the values for cell_number and direction, we can move through our 3D block model and explore it by looking at different 2D planes.

We can do the same with out lithological scalar-field solution:

In [59]:
In [60]:

This illustrates well the fold-related deformation of the stratigraphy, as well as the way the layers are influenced by the fault.

The fault network modeling solutions can be visualized in the same way:

In [61]:
Out[61]:
array([[0.03075848, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.83598092, 0.80357372, 0.77174354, 0.72471042]])
In [62]:
In [63]:

Marching cubes and vtk visualization

In addition to 2D sections we can extract surfaces to visualize in 3D renderers. Surfaces can be visualized as 3D triangle complexes in VTK (see function plot_surfaces_3D below). To create these triangles, we need to extract respective vertices and simplices from the potential fields of lithologies and faults. This process is automatized in GemPy with the function get_surface\ .

In [64]:
H:\Users\Administrator\lib\site-packages\pyvista\plotting\tools.py:622: PyVistaDeprecationWarning: The usage of `parse_color` is deprecated in favor of the new `Color` class.
  warnings.warn(
H:\Users\Administrator\lib\site-packages\pyvista\utilities\helpers.py:507: UserWarning: Points is not a float type. This can cause issues when transforming or applying filters. Casting to ``np.float32``. Disable this by passing ``force_float=False``.
  warnings.warn(

Screenshot%202023-01-21%20140837.png

Using the rescaled interpolation data, we can also run our 3D VTK visualization in an interactive mode which allows us to alter and update our model in real time. Similarly to the interactive 3D visualization of our input data, the changes are permanently saved (in the InterpolationInput.dataframe object). Additionally, the resulting changes in the geological models are re-computed in real time.

Adding topography

In [65]:
Active grids: ['topography']
Out[65]:
Grid Object. Values: 
array([[   0.        ,    0.        ,  674.04326366],
       [   0.        ,   40.81632653,  651.65388764],
       [   0.        ,   81.63265306,  633.17839754],
       ...,
       [2000.        , 1918.36734694,  727.36375607],
       [2000.        , 1959.18367347,  711.62311727],
       [2000.        , 2000.        ,  699.1870782 ]])
In [66]:
H:\Users\Administrator\lib\site-packages\gempy\core\solution.py:180: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  self.geological_map = np.array(
H:\Users\Administrator\lib\site-packages\gempy\plot\plot_api.py:261: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  p.fig.show()
H:\Users\Administrator\lib\site-packages\pyvista\plotting\tools.py:622: PyVistaDeprecationWarning: The usage of `parse_color` is deprecated in favor of the new `Color` class.
  warnings.warn(
H:\Users\Administrator\lib\site-packages\pyvista\utilities\helpers.py:507: UserWarning: Points is not a float type. This can cause issues when transforming or applying filters. Casting to ``np.float32``. Disable this by passing ``force_float=False``.
  warnings.warn(

1.png

Screenshot%202023-01-21%20140923.png

Screenshot%202023-01-21%20143336.png

Compute at a given location

This is done by modifying the grid to a custom grid and recomputing. Notice that the results are given as grid + surfaces_points_ref + surface_points_rest locations

In [52]:
Active grids: []

Therefore if we just want the value at x_i:

In [53]:
Out[53]:
array([array([[6.]]), array([[0.18630133],
                             [0.63163565]])], dtype=object)

This return the id, and the scalar field values for each series